Refresher (1) As a refresher to statistical plots, let’s build a scatter plot with an additional statistic layer.
A dataset called movies_small is coded in your workspace. It is a random sample of 1000 observations from the larger movies dataset, that’s inside the ggplot2movies package. The dataset contains information on movies from IMDB. The variable votes is the number of IMDB users who have rated a movie and the rating (converted into a categorical variable) is the average rating for the movie.
# Create movies_small
library(ggplot2movies)
library(ggplot2)
set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))
# Explore movies_small with str()
str(movies_small)
Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 24 variables:
$ title : chr "Fair and Worm-er" "Shelf Life" "House: After Five Years of Living" "Three Long Years" ...
$ year : int 1946 2000 1955 2003 1963 1992 1999 1972 1994 1985 ...
$ length : int 7 4 11 76 103 107 87 84 127 94 ...
$ budget : int NA NA NA NA NA NA NA NA NA NA ...
$ rating : Factor w/ 10 levels "1","2","3","4",..: 7 7 6 8 8 5 4 8 5 5 ...
$ votes : int 16 11 15 11 103 28 105 9 37 28 ...
$ r1 : num 0 0 14.5 4.5 4.5 4.5 14.5 0 4.5 4.5 ...
$ r2 : num 0 0 0 0 4.5 0 4.5 0 4.5 0 ...
$ r3 : num 0 0 4.5 4.5 0 4.5 4.5 0 14.5 4.5 ...
$ r4 : num 0 0 4.5 0 4.5 4.5 4.5 0 4.5 14.5 ...
$ r5 : num 4.5 4.5 0 0 4.5 0 4.5 14.5 24.5 4.5 ...
$ r6 : num 4.5 24.5 34.5 4.5 4.5 0 14.5 0 4.5 14.5 ...
$ r7 : num 64.5 4.5 24.5 0 14.5 4.5 14.5 14.5 14.5 14.5 ...
$ r8 : num 14.5 24.5 4.5 4.5 14.5 24.5 14.5 24.5 14.5 14.5 ...
$ r9 : num 0 0 0 14.5 14.5 24.5 14.5 14.5 4.5 4.5 ...
$ r10 : num 14.5 24.5 14.5 44.5 44.5 24.5 14.5 44.5 4.5 24.5 ...
$ mpaa : chr "" "" "" "" ...
$ Action : int 0 0 0 0 0 0 0 0 0 0 ...
$ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
$ Comedy : int 1 0 0 1 0 1 1 1 0 0 ...
$ Drama : int 0 0 0 0 1 0 0 0 1 1 ...
$ Documentary: int 0 0 1 0 0 0 0 0 0 0 ...
$ Romance : int 0 0 0 0 0 0 1 0 0 0 ...
$ Short : int 1 1 1 0 0 0 0 0 0 0 ...
# Build a scatter plot with mean and 95% CI
ggplot(movies_small, aes(x = rating, y = votes)) +
geom_point() +
stat_summary(fun.data = "mean_cl_normal",
geom = "crossbar",
width = 0.2,
col = "red") +
scale_y_log10()

Refresher (2) The plot in the graphics device is a variation on an oft-seen ggplot2 example using the diamonds dataset (containing information on several variables of over 50,000 diamonds).
Recall that there are a variety of scale_ functions. Here, data are transformed or filtered first, after which the plot and associated statistics are computed. For example, scale_y_continuous(limits = c(100, 1000) will remove values outside that range.
Contrast this to coord_cartesian(), which computes the statistics before plotting. That means that the plot and summary statistics are performed on the raw data. That’s why we say that coord_cartesian(c(100, 1000)) “zooms in” a plot. This was discussed in the chapter on coordinates in course 2.
Here we’re going to expand on this and introduce scale_x_log10() and scale_y_log10() which perform log10 transformations, and coord_equal(), which sets an aspect ratio of 1 (coord_fixed() is also an option).
Your task is to reproduce the plot in the viewer. Before you do this, it might be a good idea to explore diamonds in the console if you are not familiar with it.
# Reproduce the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
geom_point(alpha = 0.5, size = 0.5, shape = 16) +
scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
scale_color_brewer(palette = "YlOrRd") +
coord_equal() +
theme_classic()

Refresher (3) The goal plot from the previous exercise is coded in your editor. Here you’ll expand on this plot with stat_smooth() model instead of showing every data point.
# Add smooth layer and facet the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
stat_smooth(method = "lm") +
scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
scale_color_brewer(palette = "YlOrRd") +
coord_equal() +
theme_classic()

Transformations In this exercise you’ll return to the first plotting exercise and see how box plots compare to dot plots for representing high-density data.
Box plots are very useful, but they don’t solve all your problems all the time, for example, when your data are heavily skewed, you will still need to transform it. You’ll see that here, using the movies_small dataset, a subset of 10,000 observations of ggplot2movies::movies.
# movies_small is available
# Add a boxplot geom
d <- ggplot(movies_small, aes(x = rating, y = votes)) +
geom_point() +
geom_boxplot() +
stat_summary(fun.data = "mean_cl_normal",
geom = "crossbar",
width = 0.2,
col = "red")
# Untransformed plot
d

# Transform the scale
d + scale_y_log10()

# Transform the coordinates
d + coord_trans(y = "log10")
Error in if (zero_range(range)) { : missing value where TRUE/FALSE needed

Cut it up! If you only have continuous variables, you can convert them into ordinal variables using any of the following functions:
cut_interval(x, n) makes n groups from vector x with equal range. cut_number(x, n) makes n groups from vector x with (approximately) equal numbers of observations. cut_width(x, width) makes groups of width width from vector x. This is useful when you want to summarize a complex scatter plot like the one shown in the viewer. By applying these functions to the carat variable and mapping that onto the group aesthetic, you can convert the scatter plot in the viewer into a series of box plots on the fly.
# Plot object p
p <- ggplot(diamonds, aes(x = carat, y = price))
# Use cut_interval
p + geom_boxplot(aes(group = cut_interval(carat, n=10)))

# Use cut_number
p + geom_boxplot(aes(group = cut_number(carat, n=10)))

# Use cut_width
p + geom_boxplot(aes(group = cut_width(carat, width = 0.25)))

geom_density() To make a straightforward density plot, add a geom_density() layer.
Before plotting, you will calculate the emperical density function, similar to how you can use the density() function in the stats package, available by default when you start R. The following default parameters are used (you can specify these arguments both in density() as well as geom_density()):
bw = “nrd0”, telling R which rule to use to choose an appropriate bandwidth. kernel = “gaussian”, telling R to use the Gaussian kernel. We’ve already prepared a data frame test_data for you, containing three columns: norm, bimodal and uniform. Each column represents 200 samples from a normal, bimodal and uniform distribution.
rn <- rnorm(200, 0, 1)
bimodalDistFunc <- function (n,cpct, mu1, mu2, sig1, sig2) {
y0 <- rlnorm(n,mean=mu1, sd = sig1)
y1 <- rlnorm(n,mean=mu2, sd = sig2)
flag <- rbinom(n,size=1,prob=cpct)
y <- y0*(1 - flag) + y1*flag
}
bm <- bimodalDistFunc(n=200,0.4,-1,1, 1,1)
ud <- runif(200, -2, 1)
test_data <- data.frame("norm" = rn,
"bimodal" = bm,
"uniform" = ud)
head(test_data)
# test_data is available
# Calculating density: d
d <- density(test_data$norm)
# Use which.max() to calculate mode
mode <- d$x[which.max(d$y)]
# Finish the ggplot call
ggplot(test_data, aes(x = norm)) +
geom_rug() +
geom_density() +
geom_vline(xintercept = mode, col = "red")

Combine density plots and histogram Sometimes it is useful to compare a histogram with a density plot. However, the histogram’s y-scale must first be converted to frequency instead of absolute count. After doing so, you can add an empirical PDF using geom_density() or a theoretical PDF using stat_function().
Can you finish the plot below by following the steps?
# test_data is available
# Arguments you'll need later on
fun_args <- list(mean = mean(test_data$norm), sd = sd(test_data$norm))
# Finish the ggplot
ggplot(test_data, aes(x = norm)) +
geom_histogram(aes(y=..density..))+
geom_density(col = "red") +
stat_function(fun = dnorm, args = fun_args, col="blue")

Adjusting density plots There are three parameters that you may be tempted to adjust in a density plot:
bw - the smoothing bandwidth to be used, see ?density for details adjust - adjustment of the bandwidth, see density for details kernel - kernel used for density estimation, defined as “g” = gaussian “r” = rectangular “t” = triangular “e” = epanechnikov “b” = biweight “c” = cosine “o” = optcosine In this exercise you’ll use a dataset containing only four points, small_data, so that you can see how these three arguments affect the shape of the density plot.
The vector get_bw contains the bandwidth that is used by default in geom_density(). p is a basic plotting object that you can start from.
# small_data is available
small_data <- data.frame("x" = c(-3.5, 0.0,0.5, 6.0))
# Get the bandwith
get_bw <- density(small_data$x)$bw
# Basic plotting object
p <- ggplot(small_data, aes(x = x)) +
geom_rug() +
coord_cartesian(ylim = c(0,0.5))
# Create three plots
p + geom_density()

p + geom_density(adjust = 0.25)

p + geom_density(bw = 0.25 * get_bw)

# Create two plots
p + geom_density(kernel = "r")

p + geom_density(kernel = "e")

Box plots with varying width A drawback of showing a box plot per group, is that you don’t have any indication of the sample size, n, in each group, that went into making the plot. One way of dealing with this is to use a variable width for the box, which reflects differences in n.
Can you add some good-looking box plots to the basic plot coded on the right?
# Finish the plot
ggplot(diamonds, aes(x = cut, y = price, col = color)) +
geom_boxplot(varwidth = TRUE) +
facet_grid(. ~ color)

Mulitple density plots In this exercise you’ll combine multiple density plots. Here, you’ll combine just two distributions, a normal and a bimodal.
The first thing to remember is that you can consider values as two separate variables, like in the test_data data frame, or as a single continuous variable with their ID as a separate categorical variable, like in the test_data2 data frame. test_data2 is more convenient for combining and comparing multiple distributions.
test_data2 <- data.frame("dist" = c(rep("norm", 200), rep("bimodal", 200)),
"value" = c(test_data$norm, test_data$bimodal))
# test_data and test_data2 are available
str(test_data)
'data.frame': 200 obs. of 3 variables:
$ norm : num -0.602 -0.994 1.027 0.751 -1.509 ...
$ bimodal: num 0.986 1.232 3.668 0.414 0.094 ...
$ uniform: num 0.197 -0.171 -1.327 0.749 0.408 ...
str(test_data2)
'data.frame': 400 obs. of 2 variables:
$ dist : Factor w/ 2 levels "bimodal","norm": 2 2 2 2 2 2 2 2 2 2 ...
$ value: num -0.602 -0.994 1.027 0.751 -1.509 ...
# Plot with test_data
ggplot(test_data, aes(x = norm)) +
geom_rug()+
geom_density()

# Plot two distributions with test_data2
ggplot(test_data2, aes(x = value, fill = dist, col = dist)) +
geom_rug(alpha = 0.6) +
geom_density(alpha = 0.6)

Multiple density plots (2) When you looked at multiple box plots, you compared the total sleep time of various mammals, sorted according to their eating habits. One thing you noted is that for insectivores, box plots didn’t really make sense, since there were only 5 observations to begin with. You decided that you could nonetheless use the width of a box plot to show the difference in sample size between the groups. Here, you’ll see a similar thing with density plots.
A cleaned up version of the mammalian dataset is available as mammals.
head(msleep)
mammals <- msleep[,c("vore","sleep_total")]
mammals
# Individual densities
ggplot(mammals[mammals$vore == "Insecti", ], aes(x = sleep_total, fill = vore)) +
geom_density(col = NA, alpha = 0.35) +
scale_x_continuous(limits = c(0, 24)) +
coord_cartesian(ylim = c(0, 0.3))

# With faceting
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
geom_density(col = NA, alpha = 0.35) +
scale_x_continuous(limits = c(0, 24)) +
coord_cartesian(ylim = c(0, 0.3)) +
facet_wrap( ~ vore, nrow = 2)

# Note that by default, the x ranges fill the scale
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
geom_density(col = NA, alpha = 0.35) +
scale_x_continuous(limits = c(0, 24)) +
coord_cartesian(ylim = c(0, 0.3))

# Trim each density plot individually
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
geom_density(col = NA, alpha = 0.35, trim = TRUE) +
scale_x_continuous(limits=c(0,24)) +
coord_cartesian(ylim = c(0, 0.3))

Weighted density plots When plotting a single variable, the density plots (and their bandwidths) are calculated separate for each variable (see the plot from the previous exercise, provided).
However, when you compare several variables (such as eating habits) it’s useful to see the density of each subset in relation to the whole data set. This holds true for multiple density plots as well as for violin plots.
For this, we need to weight the density plots so that they’re relative to each other. Each density plot is adjusted according to what proportion of the total data set each sub-group represents. We calculated this using the dplyr commands on lines 11-15.
The mammals data frame is available as before. After executing the commnads, it will have the variable n, which we’ll use for weighting.
# Unweighted density plot from before
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
geom_density(col = NA, alpha = 0.35) +
scale_x_continuous(limits = c(0, 24)) +
coord_cartesian(ylim = c(0, 0.3))

# Unweighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
geom_violin()

# Calculate weighting measure
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
mammals2 <- mammals %>%
group_by(vore) %>%
mutate(n = n() / nrow(mammals)) -> mammals
# Weighted density plot
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
geom_density(aes(weight = n), col = NA, alpha = 0.35) +
scale_x_continuous(limits = c(0, 24)) +
coord_cartesian(ylim = c(0, 0.3))

# Weighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
geom_violin(aes(weight = n), col = NA)

2D density plots (1) You can consider two orthogonal density plots in the form of a 2D density plot. Just like with a 1D density plot, you can adjust the bandwidth of both axes independently.
The data is stored in the faithful data frame, available in the datasets package. The object p contains the base definitions of a plot.
# Base layers
p <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
coord_fixed(60 / 4.5)
# 1 - Use geom_density_2d()
p + geom_density_2d()

# 2 - Use stat_density_2d() with arguments
p + stat_density_2d(aes(col = ..level..), h = c(5, 0.5))

2D density plots (2) Continuing with the density plots from the last exercise, here you’ll explore the viridis package. This package contains multi-hue color palettes suitable for continuous variables.
The advantage of these scales is that instead of providing an even color gradient for a continuous scale, they highlight the highest values by using an uneven color gradient on purpose. The high values are lighter colors (yellow versus blue), so they stand out more.
A shaded 2D density plot showing the same data as the previous exercise has been provided for you. Up to you to upgrade it!
# Load in the viridis package
library(viridis)
package 'viridis' was built under R version 3.4.4Loading required package: viridisLite
# Add viridis color scale
ggplot(faithful, aes(x = waiting, y = eruptions)) +
scale_y_continuous(limits = c(1, 5.5), expand = c(0,0)) +
scale_x_continuous(limits = c(40, 100), expand = c(0,0)) +
coord_fixed(60/4.5) +
stat_density_2d(geom = "tile", aes(fill = ..density..), h=c(5,.5), contour = FALSE)+ scale_fill_viridis()

Pair plots and correlation matrices On startup, R features two useful quick-and-dirty pairs plots functions. They both only take continuous variables.
You’ll be working with the iris dataset and with mtcars_fact, a version of mtcars where categorical variables have been converted into actual factor columns.
# pairs
pairs(iris[1:4])

# chart.Correlation
library(PerformanceAnalytics)
Loading required package: xts
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Attaching package: 'xts'
The following objects are masked from 'package:dplyr':
first, last
Package PerformanceAnalytics (1.5.2) loaded.
Copyright (c) 2004-2018 Peter Carl and Brian G. Peterson, GPL-2 | GPL-3
https://github.com/braverock/PerformanceAnalytics
Attaching package: 'PerformanceAnalytics'
The following object is masked from 'package:graphics':
legend
chart.Correlation(iris[1:4])

# ggpairs
library(GGally)
Attaching package: 'GGally'
The following object is masked from 'package:dplyr':
nasa
ggpairs(iris[1:3])

Create a correlation matrix in ggplot2 Instead of using an off-the-shelf correlation matrix function, you can of course create your own plot. Just for fun, in this exercise, you’ll re-create the scatterplot you see on the right. The strength of the correlation is depicted by the size and color of the points and labels.
For starters, a correlation matrix can be calculated using, for example, cor(dataframe) (if all variables are numerical). Before you can use your data frame to create your own correlation matrix plot, you’ll need to get it in the right format.
In the editor, you can see the definition of cor_list(), a function that re-formats the data frame x. Here, L is used to add the points to the lower triangle of the matrix, and M is used to add the numerical values as text to the upper triangle of the matrix. With reshape2::melt(), the correlation matrices L and M are each converted into a three-column data frame: the x and y axes of the correlation matrix make up the first two columns and the corresponding correlation coefficient makes up the third column. These become the new variables “points” and “labels”, which can be mapped onto the size aesthetic for the points in the lower triangle and onto the label aesthetic for the text in the upper triangle, respectively. Their values will be the same, but their positions on the plot will be symmetrical about the diagonal! Merging L and M, you have everything you need.
If you’re not familiar with reshape2 - don’t worry, the only reason we use that instead of tidyr is that reshape2::melt() can handle a matrix, whereas tidyr::gather() requires a data frame. At this point you just need to understand how to use the output from cor_list().
You’ll first use dplyr to execute this function on the continuous variables in the iris data frame (the first four columns), but separately for each species. Please refer to the course on dplyr if you are not familiar with these functions.
Next, you’ll actually plot the resulting data frame with ggplot2 functions.
library(ggplot2)
library(reshape2)
cor_list <- function(x) {
L <- M <- cor(x)
M[lower.tri(M, diag = TRUE)] <- NA
M <- melt(M)
names(M)[3] <- "points"
L[upper.tri(L, diag = TRUE)] <- NA
L <- melt(L)
names(L)[3] <- "labels"
merge(M, L)
}
# Calculate xx with cor_list
library(dplyr)
xx <- iris %>%
group_by(Species) %>%
do(cor_list(.[1:4]))
# Finish the plot
ggplot(xx, aes(x = Var1, y = Var2)) +
geom_point(aes(col = points, size = abs(points)), shape = 16) +
geom_text(aes(col = labels, size = abs(labels), label = round(labels, 2))) +
scale_size(range = c(0, 6)) +
scale_color_gradient2("r", limits = c(-1, 1)) +
scale_y_discrete("", limits = rev(levels(xx$Var1))) +
scale_x_discrete("") +
guides(size = FALSE) +
geom_abline(slope = -1, intercept = nlevels(xx$Var1) + 1) +
coord_fixed() +
facet_grid(. ~ Species) +
theme(axis.text.y = element_text(angle = 45, hjust = 1),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_blank())

Proportional/stacked bar plots Before you head over to ternary plots, let’s try to make a classical proportional/stacked bar plot of a subset of the data. We’ll use a stacked bar plot and the coord_flip() function to flips the x and y axes.
The data frame for the African Soil Profiles Database is available in your workspace as africa and can be found in the GSIF package. It contains three columns: Sand, Silt and Clay. A smaller version, containing only 50 observations is stored in africa_sample.
In the first course we mentioned that in the data layer, the structure of the data should reflect how you wish to plot it. For a ternary plot, you need to have three separate variables, for example, Sand, Silt and Clay in africa. However, for a proportional/stacked bar plot, you just need two. The type should be defined as three levels within a single factor variable. That is, you want tidy data.
It’s also useful to maintain the site IDs as a variable within the data frame, currently, they are stored at row names, which is poor style and not useful.
# Explore africa
str(africa)
'data.frame': 40093 obs. of 3 variables:
$ Sand: num 24 36 56 52 65 43 42 47 57 51 ...
$ Silt: num 12 14 18 21 3 14 22 19 15 14 ...
$ Clay: num 64 50 26 27 32 43 36 34 28 35 ...
africa_sample <- africa[sample(nrow(africa), 50), ]
str(africa_sample)
'data.frame': 50 obs. of 3 variables:
$ Sand: num 5 58 35 34 42 65 15 89 63 87 ...
$ Silt: num 17 26 28 10 14 8 43 4 3 6 ...
$ Clay: num 78 16 37 56 44 27 42 7 34 7 ...
# Add an ID column from the row.names
africa_sample$ID <- row.names(africa_sample)
# Gather africa_sample
library(tidyr)
Attaching package: 'tidyr'
The following object is masked from 'package:reshape2':
smiths
africa_sample_tidy <- gather(africa_sample, key, value, -ID)
head(africa_sample_tidy)
# Finish the ggplot command
ggplot(africa_sample_tidy, aes(x = factor(ID), y = value, fill = key)) +
geom_col() +
coord_flip()

Producing ternary plots Ok, let’s move onto ternary plots. For this you’ll use the ggtern package, which provides the ggtern() function.
In contrast to what you just saw in africa_small_tidy, the three soil properties, Sand, Silt and Clay, are not going to be located in a single variable. The distinction between wide and tidy format data was discussed in the first course and here you’ll see it in action. Sometimes you need to rearrange your data for the desired plot type.
Here, you’ll use the complete dataset, africa, containing three separate variables for the measures of interest: that format is perfect for a ternary plot.
# Load ggtern
library(ggtern)
--
Consider donating at: http://ggtern.com
Even small amounts (say $10-50) are very much appreciated!
Remember to cite, run citation(package = 'ggtern') for further info.
--
Attaching package: 'ggtern'
The following objects are masked from 'package:ggplot2':
%+%, aes, annotate, calc_element, ggplot, ggplotGrob, ggplot_build, ggplot_gtable, ggsave,
layer_data, theme, theme_bw, theme_classic, theme_dark, theme_gray, theme_light, theme_linedraw,
theme_minimal, theme_void
# Build ternary plot
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
geom_point(shape=16, alpha=0.2)

Adjusting ternary plots Ternary plots have been around for a while in R; you could achieve the same thing with the vcd package authored by Michael Friendly. If you just need a quick and dirty ternary plot, that may suit you just fine. However, since ggtern is built on ggplot2, you can take advantage of all the tools available therein.
ggtern is authored by Nicholas Hamilton, more information can be found on his package website: www.ggtern.com.
The plot from the previous exercise is available twice. Can you adapt it in different ways to make different ternary density plots?
# ggtern and ggplot2 are loaded
# Original plot:
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
geom_point(shape = 16, alpha = 0.2)

# Plot 1
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
geom_density_tern()

# Plot 2
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
stat_density_tern(geom = "polygon", aes(fill = ..level.., alpha = ..level..)) +
guides(fill = FALSE)

Build the network (1) Network data may be stored in a variety of ways.
For this example, you’ll use an undirected network of romantic relationships in the TV show Mad Men: geomnet::madmen.
# Load geomnet & examine structure of madmen
library(geomnet)
str(madmen)
List of 2
$ edges :'data.frame': 39 obs. of 2 variables:
..$ Name1: Factor w/ 9 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
..$ Name2: Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
$ vertices:'data.frame': 45 obs. of 2 variables:
..$ label : Factor w/ 45 levels "Abe Drexler",..: 5 9 16 23 26 32 33 38 39 17 ...
..$ Gender: Factor w/ 2 levels "female","male": 1 2 2 1 2 1 2 2 2 2 ...
# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
by.x = "Name1", by.y = "label",
all = TRUE)
# Examine structure of mmnet
str(mmnet)
'data.frame': 75 obs. of 3 variables:
$ Name1 : Factor w/ 45 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
$ Name2 : Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
$ Gender: Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 2 ...
Build the network (2) Now that your data is in the correct format, you can build the actual network plot.
You’ll use the geom_net() function, a ggplot layer that’s in the geomnet package. The ggnetwork package is a popular alternative, but we will not discuss that here.
Can you finish the ggplot() command?
# geomnet is pre-loaded
# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
by.x = "Name1", by.y = "label",
all = TRUE)
# Finish the ggplot command
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
geom_net(aes(col=Gender), size=6, linewidth=1, labelon=TRUE, fontsize=3, labelcolour="black")

Adjusting the network Let’s clean up the network a bit. As you can see, since this is in the ggplot2 framework, you can manually adjust the scales like you have always done.
Here you’re going to use another trick to remove all theme elements and make a clean network plot.
# geomnet is pre-loaded
library(ggmap)
Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation('ggmap') for details.
# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
by.x = "Name1", by.y = "label",
all = TRUE)
# Tweak the network plot
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
geom_net(aes(col = Gender),
size = 6,
linewidth = 1,
labelon = TRUE,
fontsize = 3,
labelcolour = "black",
directed = TRUE) +
scale_color_manual(values = c("#FF69B4", "#0099ff")) +
xlim(c(-0.05, 1.05)) +
ggmap::theme_nothing(legend = TRUE) +
theme(legend.key = element_blank())
`panel.margin` is deprecated. Please use `panel.spacing` property instead

Autoplot on linear models R has several plotting methods for specific objects. For example using plot() on the results of an lm() call results in four plots that give you insight into how well the assigned model fits the data.
The ggfortify package is an all-purpose plot converter between base graphics and ggplot2 grid graphics.
You’ll explore exactly what we mean by graphics and grid in chapter 4. For now, just know that if you want to use the automatic output features in the context of ggplot2, they must first be converted to a ggplot object via ggfortify. This can be important at the superficial level, for consistency in appearance, but also at a deeper level, for later combining several plots in a single graphics device.
# Create linear model: res
res <- lm(Volume~Girth, data = trees)
# Plot res
plot(res)




# Import ggfortify and use autoplot()
library(ggfortify)
package 'ggfortify' was built under R version 3.4.4
autoplot(res, ncol=2)

ggfortify - time series Time series objects (class mts or ts) also have their own methods for plot(). ggfortify can also take advantage of this functionality.
In the workspace, you’ll find the variable Canada (it comes from the vars package): an mts class object with four series: prod is a measure of labour productivity, e is employment, U is the unemployment rate, and rw the real wage. They are each plotted as separate series by default.
# ggfortify and Canada are available
library(vars)
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked _by_ '.GlobalEnv':
mammals
The following object is masked from 'package:dplyr':
select
Loading required package: strucchange
Loading required package: sandwich
Loading required package: urca
Loading required package: lmtest
package 'lmtest' was built under R version 3.4.4
# Inspect structure of Canada
str(Canada)
Time-Series [1:84, 1:4] from 1980 to 2001: 930 930 930 931 933 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "e" "prod" "rw" "U"
# Call plot() on Canada
plot(Canada)

# Call autoplot() on Canada
autoplot(Canada)

Distance matrices and Multi-Dimensional Scaling (MDS) As you can probably imagine, distance matrices (class dist) contain the measured distance between all pair-wise combinations of many points. For example, the eurodist dataset contains the distances between major European cities. dist objects lend themselves well to autoplot().
The cmdscale() function from the stats package performs Classical Multi-Dimensional Scaling and returns point coodinates as a matrix. Although autoplot() will work on this object, it will produce a heatmap, and not a scatter plot. However, if either eig = TRUE, add = TRUE or x.ret = TRUE is specified, cmdscale() will return a list instead of matrix. In these cases, the list method for autoplot() in the ggfortify package can deal with the output. Specifics on multi-dimensional scaling is beyond the scope of this course, however details on the method and these arguments can be found in the help pages ?cmdscale.
# ggfortify and eurodist are available
# Autoplot + ggplot2 tweaking
autoplot(eurodist) +
coord_fixed()
Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.

# Autoplot of MDS
autoplot(cmdscale(eurodist, eig = TRUE),
label = TRUE,
label.size = 3,
size = 0)

Plotting K-means clustering ggfortify also supports stats::kmeans class objects. You must explicitly pass the original data to the autoplot function via the data argument, since kmeans objects don’t contain the original data. The result will be automatically colored according to cluster.
Here, you’ll use the iris dataset and just look at K-means clustering, although this works on many clustering methods, including cluster::clara(), cluster::fanny(), cluster::pam() and stats::prcomp(). Unfortunately a discussion of these clustering methods is beyond the scope of this course.
# Perform clustering
iris_k <- kmeans(iris[-5], 3)
# Autoplot: color according to cluster
autoplot(iris_k, data = iris, frame = TRUE)

# Autoplot: above, plus shape according to species
autoplot(iris_k, data = iris, frame = TRUE, shape='Species')

Working with maps from the maps package: USA The easiest way to obtain map polygons is through the maps package. Unfortunately there are only a few locations available, but if your region of interest is included they are extremely convenient.
The available maps of political boundaries are:
Global: world, world2 Country: france, italy, nz, usa USA: county, state The maps can be accessed via ggplot2::map_data(), which converts the map into a data frame containing the variables long and lat. To draw the map, you need to use geom_polygon() which will connect the points of latitude and longitude for you.
library(maps)
package 'maps' was built under R version 3.4.4
# maps, ggplot2, and ggmap are pre-loaded
# Use map_data() to create usa and inspect
usa <- map_data("usa")
str(usa)
'data.frame': 7243 obs. of 6 variables:
$ long : num -101 -101 -101 -101 -101 ...
$ lat : num 29.7 29.7 29.7 29.6 29.6 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "main" "main" "main" "main" ...
$ subregion: chr NA NA NA NA ...
# Build the map
ggplot(usa, aes(x = long, y = lat, group = group)) +
geom_polygon() +
coord_map() +
theme_nothing()
`panel.margin` is deprecated. Please use `panel.spacing` property instead

The population pyramid Animations are particularly useful for temporal or geospatial data, and they are surprisingly easy to make! Here, you simply loop over the time variable in your dataset, composing a new plot for each subset in the data. These individual images are then cataloged in an animated GIF file.
To show this you’ll use a great animated population pyramid that was presented on the Revolutions blog. There are many more adjustments you could have made to the plot, but we’ll just make a barebones version here.
japan <- read.table("japanPOP.txt", header=TRUE)
head(japan)
# Inspect structure of japan
str(japan)
'data.frame': 8282 obs. of 4 variables:
$ AGE : int 0 1 2 3 4 5 6 7 8 9 ...
$ POP : int -572954 -581748 -585239 -582223 -568788 -571899 -590530 -602349 -612527 -620373 ...
$ time: int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ SEX : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
library(animation)
# Finish the code inside saveGIF
saveGIF({
# Loop through all time points
for (i in unique(japan$time)) {
# Subset japan: data
data <- subset(japan, time == i)
# Finish the ggplot command
p <- ggplot(data, aes(x = AGE, y = POP, fill = SEX, width = 1)) +
coord_flip() +
geom_bar(data = data[data$SEX == "Female",], stat = "identity") +
geom_bar(data = data[data$SEX == "Male",], stat = "identity") +
ggtitle(i)
print(p)
}
}, movie.name = "pyramid.gif", interval = 0.1)
sh: convert: command not found
Error in cmd.fun(sprintf("%s --version", convert), intern = TRUE) :
error in running command
[1] FALSE
Arranging plots (1) The functions in gridExtra allow you to arrange any number of plots in a variety of ways. Since you can access the legend as a separate object, that means you can also arrange multiple plots with a single legend, as shown in the viewer. This is a good alternative to faceting, since with facets it’s not possible to set a different geom for each sub-plot. Here, you can combine any variety of plots and use a consistent color scale with only one legend to unify the whole image.
To do this you’ll create a new arrange graphical object, using grid.arrange(), which will combine several pre-existing grobs. Just like with grid.rect() and rectGrob() there are two versions of the arrange grob, one grid.arrange() produces a graphics output, which means you just draw the item to the viewer, and arrangeGrob() which returns a graphical object, aka grob which can be further manipulated.
In this exercise, you’ll just create your objects and arrange them using grid.arrange(). In the first steps you created two basic plots, g1 and g2. In the next exercise you’ll see what to do about the legend.
# Add a theme (legend at the bottom)
g1 <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
geom_point(alpha = 0.5) +
theme(legend.position = "bottom")
# Add a theme (no legend)
g2 <- ggplot(mtcars, aes(disp, fill = cyl)) +
geom_histogram(position = "identity", alpha = 0.5, binwidth = 20) +
theme(legend.position = "none")
# Load gridExtra
library(gridExtra)
Attaching package: 'gridExtra'
The following objects are masked from 'package:ggtern':
arrangeGrob, grid.arrange
The following object is masked from 'package:dplyr':
combine
# Call grid.arrange()
grid.arrange(g1, g2, ncol = 2)

Arranging plots (2) In the previous exercise you did a bare-bones arrangement of plots, but it would be nicer if the plot looks like the one that’s shown in the viewer. You can imagine that you have three panels, not two. There are two asymmetrical rows, the small second row is where the legend is, and two symmetrical columns, where the plots are.
To obtain this plot you need to extract the legend. You already saw this in previous exercises and it has already been done for you; the legend is available as my_legend. Next you need to arrange all the items appropriately.
# ggplot2, grid and gridExtra have been loaded for you
# Definitions of g1 and g2
library("grid")
g1 <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
geom_point() +
theme(legend.position = "bottom")
g2 <- ggplot(mtcars, aes(disp, fill = cyl)) +
geom_histogram(binwidth = 20) +
theme(legend.position = "none")
legend_index <- 15
# Extract the legend from g1
my_legend <- ggplotGrob(g1)$grobs[[legend_index]]
# Create g1_noleg
g1_noleg <- g1 +
theme(legend.position = "none")
# Calculate the height: legend_height
legend_height <- sum(my_legend$heights)
# Arrange g1_noleg, g2 and my_legend
grid.arrange(g1_noleg, g2, my_legend,
layout_matrix = matrix(c(1, 3, 2, 3), ncol = 2),
heights = unit.c(unit(1, "npc") - legend_height, legend_height))


Base package bag plot Before you create your own stats layer, you’ll begin by understanding what a bag plot is, and how to get the data for your own plots.
For this you’ll use a fake dataset called test_data, which only contains two variables. A scatter plot is shown in the viewer.
The aplpack package, which contains the bagplot() and compute.bagplot() functions, has been loaded for you.
library(aplpack)
x = floor(runif(60, 1700, 3700 ))
y = floor(runif(60, 50, 330))
test_data <- data.frame("x" = x, "y" = y)
test_data
# Call bagplot() on test_data
bagplot(test_data)
# Call compute.bagplot on test_data, assign to bag
bag <- compute.bagplot(test_data)
# Display information
bag$hull.loop
x y
3673 176
3360 98
2808 66
2516 59
1705 70
1745 259
1773 307
1806 320
2698 307
3539 288
3591 283
3647 232
bag$hull.bag
[,1] [,2]
3448.683 201.23791
3435.747 198.10602
3341.618 175.55902
3119.893 139.79230
3007.928 125.57880
2890.010 110.68831
2727.063 96.41987
2596.434 85.44424
2532.538 81.69941
2484.876 78.91393
2390.600 81.80400
2342.252 86.06236
2291.073 94.43000
2285.795 95.37164
2055.984 136.86930
2031.283 142.80128
sl 1831.228 237.06798
1830.284 239.98771
1833.782 246.91343
1880.925 276.84057
1920.734 283.47928
1946.224 285.44842
2161.923 291.78007
2314.226 296.18919
2454.514 300.14002
2455.218 300.15760
2563.785 300.30527
2634.528 299.70904
2848.287 296.71008
2866.720 296.27750
2891.040 295.68152
2914.951 294.27776
2931.411 293.26476
3009.666 283.97861
sr 3368.374 238.53977
sr 3385.561 234.81173
sr 3473.191 209.94078
bag$pxy.outlier
NULL
# Highlight components
points(bag$hull.loop, col = "green", pch = 16)
points(bag$hull.bag, col = "orange", pch = 16)
points(bag$pxy.outlier, col = "purple", pch = 16)

Multilayer ggplot2 bag plot The viewer shows the plot you created in the last exercise.
With our current understanding, if we wanted to make a bag plot in ggplot2, we’d take the three data frames (for the loop, bag and outliers) and add them using three separate geom layers.
Let’s see how this simple solution works and in the next exercises you’ll expand on this topic to make a real stats layer.
The bag and test_data objects from the previous exercise are provided. test_data contains two variables: x and y.
# bag and test_data are available
# Create data frames from matrices
hull.loop <- data.frame(x = bag$hull.loop[,1], y = bag$hull.loop[,2])
hull.bag <- data.frame(x = bag$hull.bag[,1], y = bag$hull.bag[,2])
pxy.outlier <- data.frame(x = bag$pxy.outlier[,1], y = bag$pxy.outlier[,2])
# Finish the ggplot command
ggplot(test_data, aes(x = x, y = y)) +
geom_polygon(data = hull.loop, fill = "green") +
geom_polygon(data = hull.bag, fill = "orange") +
geom_point(data = pxy.outlier, col = "purple", pch = 16, cex = 1.5)

Creating ggproto functions Now that you know where to find the statistics and how to use them in ggplot2, let’s put them into the functions that will make them easier to use.
For this you’ll use the ggproto object-oriented programming system - the basis of creating a new layer in ggplot2. There are four arguments for a ggproto object. The first two arguments are its name and what it inherits from (in this case Stat). Next come the required aesthetics, and then, most importantly, what the stat should do. For each group of data it receives from the data layer, what should be computed? This will simply be the calculations you performed in the previous exercise.
The ggproto object definition of StatLoop is already provided. Can you finish the implementations for the other ones?
# ggproto for StatLoop (hull.loop)
StatLoop <- ggproto("StatLoop", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales) {
bag <- compute.bagplot(x = data$x, y = data$y)
data.frame(x = bag$hull.loop[,1], y = bag$hull.loop[,2])
})
# ggproto for StatBag (hull.bag)
StatBag <- ggproto("StatBag", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales) {
bag <- compute.bagplot(x = data$x, y = data$y)
data.frame(x = bag$hull.bag[,1], y = bag$hull.bag[,2])
})
# ggproto for StatOut (pxy.outlier)
StatOut <- ggproto("StatOut", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales) {
bag <- compute.bagplot(x = data$x, y = data$y)
data.frame(x = bag$pxy.outlier[,1], y = bag$pxy.outlier[,2])
})
Creating stat_bag() In the previous exercise you established three ggproto objects, now you need to combine them under a new ggplot2 function that you’ll call stat_bag().
Adding a stat_bag() layer will execute each of the three ggproto objects that you just created.
Your three objects are called StatLoop, StatBag, StatOut, so you’ll need three layers in your stat_bag() function, which you’ll make with the layer() function. When you have multiple layers, you can combine them in a list by simply calling list().
For each layer, you’ll also need to specify the approrpiate geom: “polygon” or “point”.
The framework for the stat_bag() layer function has been provided for you.
# StatLoop, StatBag and StatOut are available
# Combine ggproto objects in layers to build stat_bag()
stat_bag <- function(mapping = NULL, data = NULL, geom = "polygon",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, loop = FALSE, ...) {
list(
# StatLoop layer
layer(
stat = StatLoop, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, alpha = 0.35, col = NA, ...)
),
# StatBag layer
layer(
stat = StatBag, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, alpha = 0.35, col = NA, ...)
),
# StatOut layer
layer(
stat = StatOut, data = data, mapping = mapping, geom = "point",
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, alpha = 0.7, col = NA, shape = 21, ...)
)
)
}
Use stat_bag() So far you’ve seen the basics for creating a new ggplot layer. It’s bare-bones, but functional. You now have a working solution to the bag plot question.
The ggplot2 command that you’ve coded before is available. now, let’s use stat_bag() to make our plot!
# hull.loop, hull.bag and pxy.outlier are available
# stat_bag, test_data and test_data2 are available
# Previous method
ggplot(test_data, aes(x = x, y = y)) +
geom_polygon(data = hull.loop, fill = "green") +
geom_polygon(data = hull.bag, fill = "orange") +
geom_point(data = pxy.outlier, col = "purple", pch = 16, cex = 1.5)

# stat_bag
ggplot(test_data, aes(x = x, y = y)) +
stat_bag(fill = 'black')

Viewport basics (1) To get familiar with grid graphics, you’ll begin with using some grid. functions. The grid package is already loaded into your R session, so you can get started straight away!
Note: In DataCamp’s learning interface, each change you make to the plot will appear as a new plot, so you can see the effect of each command.
# Draw rectangle in null viewport
grid.rect(gp = gpar(fill = "grey90"))
# Write text in null viewport
grid.text("null viewport")
# Draw a line
grid.lines(x = c(0, 0.75), y = c(0.25, 1),
gp = gpar(lty = 2, col = "red"))
Viewport basics (2) The code from the previous exercise that populates the null viewport with some basic shapes is already available. Let’s take the next step and start manipulating the stack of viewports.
# Populate null viewport
grid.rect(gp = gpar(fill = "grey90"))
grid.text("null viewport")
grid.lines(x = c(0,0.75), y = c(0.25, 1),
gp = gpar(lty = 2, col = "red"))
# Create new viewport: vp
vp <- viewport(x = 0.5, y = 0.5, width = 0.5, height = 0.5, just = "center")
# Push vp
pushViewport(vp)
# Populate new viewport with rectangle
grid.rect(gp = gpar(fill = "blue"))
Build a plot from scratch (1) Using the viewports, you can create plots, manipulating the space as needed.
In this exercise you’ll establish your grid viewport and in the following exercise you’ll populate it with values.
# 1 - Create plot viewport: pvp
mar <- c(5, 4, 2, 2)
pvp <- plotViewport(mar)
# 2 - Push pvp
pushViewport(pvp)
# 3 - Add rectangle
grid.rect(gp = gpar(fill = "grey80"))
# Create data viewport: dvp
dvp <- dataViewport(xData = mtcars$wt, yData = mtcars$mpg)
# 4 - Push dvp
pushViewport(dvp)
# Add two axes
grid.xaxis()
grid.yaxis()
Build a plot from scratch (2) The work you did before to build a plot from scratch is already included. Now you’re ready to add the points and the appropriate labels.
# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()
# 1 - Add text to x axis
grid.text("Weight", y = unit(-3, "lines"))
# 2 - Add text to y axis
grid.text("MPG", x = unit(-3, "lines"), rot = 90)
# 3 - Add points
grid.points(x = mtcars$wt, y = mtcars$mpg, pch = 16)
Modifying a plot with grid.edit The commands you’ve coded up to now to create the plot are available in the editor. The great thing about grid, in comparison to base, is that you can name the different plot elements, so that you can access them and change them later on. You can do this with the grid.edit() function. Give it a try!
# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()
# Work from before - add names
grid.text("Weight", y = unit(-3, "lines"), name = "xaxis")
grid.text("MPG", x = unit(-3, "lines"), rot = 90, name = "yaxis")
grid.points(x = mtcars$wt, y = mtcars$mpg, pch = 16, name = "datapoints")
# Edit "xaxis"
grid.edit("xaxis", label = "Weight (1000 lbs)")


# Edit "yaxis"
grid.edit("yaxis", label = "Miles/(US) gallon")

# Edit "datapoints"
grid.edit("datapoints", gp = (gpar(col = "#C3212766", cex = 2)))

Exploring the gTable In the previous chapter you saw graphical outputs using a variety of grid. functions. Graphical Objects, aka Grobs, are the object form of these items and can be found in your ggplot2 plots. Let’s take a look at how these grobs are stored in ggplot objects.
To start, a simple plot, p, has been coded for you.
# A simple plot p
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()
# Create gtab with ggplotGrob()
gtab <- ggplotGrob(p)
# Print out gtab
gtab
TableGrob (10 x 9) "layout": 18 grobs
z cells name grob
1 0 ( 1-10, 1- 9) background rect[plot.background..rect.8067]
2 5 ( 5- 5, 3- 3) spacer zeroGrob[NULL]
3 7 ( 6- 6, 3- 3) axis-l absoluteGrob[GRID.absoluteGrob.8043]
4 3 ( 7- 7, 3- 3) spacer zeroGrob[NULL]
5 6 ( 5- 5, 4- 4) axis-t zeroGrob[NULL]
6 1 ( 6- 6, 4- 4) panel gTree[panel-1.gTree.8023]
7 9 ( 7- 7, 4- 4) axis-b absoluteGrob[GRID.absoluteGrob.8036]
8 4 ( 5- 5, 5- 5) spacer zeroGrob[NULL]
9 8 ( 6- 6, 5- 5) axis-r zeroGrob[NULL]
10 2 ( 7- 7, 5- 5) spacer zeroGrob[NULL]
11 10 ( 4- 4, 4- 4) xlab-t zeroGrob[NULL]
12 11 ( 8- 8, 4- 4) xlab-b titleGrob[axis.title.x..titleGrob.8026]
13 12 ( 6- 6, 2- 2) ylab-l titleGrob[axis.title.y..titleGrob.8029]
14 13 ( 6- 6, 6- 6) ylab-r zeroGrob[NULL]
15 14 ( 6- 6, 8- 8) guide-box gtable[guide-box]
16 15 ( 3- 3, 4- 4) subtitle zeroGrob[plot.subtitle..zeroGrob.8064]
17 16 ( 2- 2, 4- 4) title zeroGrob[plot.title..zeroGrob.8063]
18 17 ( 9- 9, 4- 4) caption zeroGrob[plot.caption..zeroGrob.8065]
# Extract the grobs from gtab: gtab
g <- gtab$grobs
# Draw only the legend
legend_index <- which(vapply(g, inherits, what = "gtable", logical(1)))
grid.draw(g[[legend_index]])
Modifying the gTable You can visualize the layout of a gTable object with gtable_show_layout(). In the layout plot, each segment is labelled with its position.
The legend, that you can access with g[[8]], is a gTable itself, so you can also show its layout. It’s perfectly possible to update this layout by adding new graphical objects, similar to what you saw in the video.
library(gtable)
# Code from before
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()
gtab <- ggplotGrob(p)
g <- gtab$grobs
legend_index <- which(vapply(g, inherits, what = "gtable", logical(1)))
grid.draw(g[[legend_index]])
# 1 - Show layout of legend grob
gtable_show_layout(g[[legend_index]])

# Create text grob
my_text <- textGrob(label = "Motor Trend, 1974", gp = gpar(fontsize = 7, col = "gray25"))
# 2 - Use gtable_add_grob to modify original gtab
new_legend <- gtable_add_grob(gtab$grobs[[legend_index]], my_text, 3, 2)
# 3 - Update in gtab
gtab$grobs[[legend_index]] <- new_legend
# 4 - Draw gtab
grid.draw(gtab)

Exploring ggplot objects ggplot objects are basically just a named list that contains the information to make the actual plot. Here you’ll explore the structure of this object.
# Simple plot p
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()
# Examine class() and names()
class(p)
[1] "gg" "ggplot"
names(p)
[1] "data" "layers" "scales" "mapping" "theme" "coordinates" "facet" "plot_env"
[9] "labels"
# Print the scales sub-list
p$scales$scales
list()
# Update p
p <- p +
scale_x_continuous("Length", limits = c(4, 8), expand = c(0, 0)) +
scale_y_continuous("Width", limits = c(2, 4.5), expand = c(0, 0))
# Print the scales sub-list
p$scales$scales
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 4 -- 8
[[2]]
<ScaleContinuousPosition>
Range:
Limits: 2 -- 4.5
ggplot_build and ggplot_gtable In the viewer we have produced a box plot of the mtcars dataset (called p) that you’ll use to explore two key ggplot functions for accessing the object internals: ggplot_build() and ggplot_gtable().
ggplot_build() is executed when you want to display or save an actual ggplot plot. It takes the data input and produces the visual output.
# Box plot of mtcars: p
p <- ggplot(mtcars, aes(x = factor(cyl), y = wt)) + geom_boxplot()
# Create pbuild
pbuild <- ggplot_build(p)
# a list of 3 elements
names(pbuild)
[1] "data" "layout" "plot"
# Print out each element in pbuild
pbuild$data
[[1]]
pbuild$panel
NULL
pbuild$plot
# Create gtab from pbuild
gtab <- ggplot_gtable(pbuild)
# Draw gtab
grid.draw(gtab)

Extracting details In the video you saw how to change the clipping parameters of a gTable object. Here, you’ll see something more practical: how to extract calculated values.
Many geoms are associated with underlying descriptive statistics which are calculated and then plotted. In these cases you actually don’t have the actual values that were plotted. Of course, these values are stored under the hood and you can access them in the results from ggplot_build(). This can be particularly useful for box plots. For example, since there are many methods for calculating Q1 and Q3, if you calculate your IQR and outliers outside of ggplot2 you may end up using a different method and get different results. Sometimes you want to have exactly the values that were plotted.
# Box plot of mtcars: p
p <- ggplot(mtcars, aes(x = factor(cyl), y = wt)) + geom_boxplot()
# Build pdata
pdata <- ggplot_build(p)$data
# confirm that the first element of the list is a data frame
class(pdata[[1]])
[1] "data.frame"
# Isolate this data frame
my_df <- pdata[[1]]
# The x labels
my_df$group <- c("4", "6", "8")
# Print out specific variables
my_df[c(1:6, 11)]
---
title: "Data Visualization with ggplot 3"
output: html_notebook
---

Refresher (1)
As a refresher to statistical plots, let's build a scatter plot with an additional statistic layer.

A dataset called movies_small is coded in your workspace. It is a random sample of 1000 observations from the larger movies dataset, that's inside the ggplot2movies package. The dataset contains information on movies from IMDB. The variable votes is the number of IMDB users who have rated a movie and the rating (converted into a categorical variable) is the average rating for the movie.

```{r}
# Create movies_small
library(ggplot2movies)
library(ggplot2)
set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))

# Explore movies_small with str()
str(movies_small)

# Build a scatter plot with mean and 95% CI
ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red") +
  scale_y_log10()
```

Refresher (2)
The plot in the graphics device is a variation on an oft-seen ggplot2 example using the diamonds dataset (containing information on several variables of over 50,000 diamonds).

Recall that there are a variety of scale_ functions. Here, data are transformed or filtered first, after which the plot and associated statistics are computed. For example, scale_y_continuous(limits = c(100, 1000) will remove values outside that range.

Contrast this to coord_cartesian(), which computes the statistics before plotting. That means that the plot and summary statistics are performed on the raw data. That's why we say that coord_cartesian(c(100, 1000)) "zooms in" a plot. This was discussed in the chapter on coordinates in course 2.

Here we're going to expand on this and introduce scale_x_log10() and scale_y_log10() which perform log10 transformations, and coord_equal(), which sets an aspect ratio of 1 (coord_fixed() is also an option).

Your task is to reproduce the plot in the viewer. Before you do this, it might be a good idea to explore diamonds in the console if you are not familiar with it.

```{r}
# Reproduce the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  geom_point(alpha = 0.5, size = 0.5, shape = 16) +
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic()
```

Refresher (3)
The goal plot from the previous exercise is coded in your editor. Here you'll expand on this plot with stat_smooth() model instead of showing every data point.

```{r}
# Add smooth layer and facet the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  stat_smooth(method = "lm") +
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic()
```

Transformations
In this exercise you'll return to the first plotting exercise and see how box plots compare to dot plots for representing high-density data.

Box plots are very useful, but they don't solve all your problems all the time, for example, when your data are heavily skewed, you will still need to transform it. You'll see that here, using the movies_small dataset, a subset of 10,000 observations of ggplot2movies::movies.

```{r}
# movies_small is available

# Add a boxplot geom
d <- ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  geom_boxplot() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red")

# Untransformed plot
d

# Transform the scale
d + scale_y_log10()

# Transform the coordinates
d + coord_trans(y = "log10")
```

Cut it up!
If you only have continuous variables, you can convert them into ordinal variables using any of the following functions:

cut_interval(x, n) makes n groups from vector x with equal range.
cut_number(x, n) makes n groups from vector x with (approximately) equal numbers of observations.
cut_width(x, width) makes groups of width width from vector x.
This is useful when you want to summarize a complex scatter plot like the one shown in the viewer. By applying these functions to the carat variable and mapping that onto the group aesthetic, you can convert the scatter plot in the viewer into a series of box plots on the fly.

```{r}
# Plot object p
p <- ggplot(diamonds, aes(x = carat, y = price))

# Use cut_interval
p + geom_boxplot(aes(group = cut_interval(carat, n=10)))

# Use cut_number
p + geom_boxplot(aes(group = cut_number(carat, n=10)))

# Use cut_width
p + geom_boxplot(aes(group = cut_width(carat, width = 0.25)))
```

geom_density()
To make a straightforward density plot, add a geom_density() layer.

Before plotting, you will calculate the emperical density function, similar to how you can use the density() function in the stats package, available by default when you start R. The following default parameters are used (you can specify these arguments both in density() as well as geom_density()):

bw = "nrd0", telling R which rule to use to choose an appropriate bandwidth.
kernel = "gaussian", telling R to use the Gaussian kernel.
We've already prepared a data frame test_data for you, containing three columns: norm, bimodal and uniform. Each column represents 200 samples from a normal, bimodal and uniform distribution.

```{r}
rn <- rnorm(200, 0, 1)

bimodalDistFunc <- function (n,cpct, mu1, mu2, sig1, sig2) {
  y0 <- rlnorm(n,mean=mu1, sd = sig1)
  y1 <- rlnorm(n,mean=mu2, sd = sig2)

  flag <- rbinom(n,size=1,prob=cpct)
  y <- y0*(1 - flag) + y1*flag 
}

bm <- bimodalDistFunc(n=200,0.4,-1,1, 1,1)
ud <- runif(200, -2, 1)
test_data <- data.frame("norm" = rn,
                        "bimodal" = bm,
                        "uniform" = ud)
head(test_data)

```

```{r}
# test_data is available

# Calculating density: d
d <- density(test_data$norm)

# Use which.max() to calculate mode
mode <- d$x[which.max(d$y)]

# Finish the ggplot call
ggplot(test_data, aes(x = norm)) +
  geom_rug() +
  geom_density() +
  geom_vline(xintercept = mode, col = "red")
```

Combine density plots and histogram
Sometimes it is useful to compare a histogram with a density plot. However, the histogram's y-scale must first be converted to frequency instead of absolute count. After doing so, you can add an empirical PDF using geom_density() or a theoretical PDF using stat_function().

Can you finish the plot below by following the steps?

```{r}
# test_data is available

# Arguments you'll need later on
fun_args <- list(mean = mean(test_data$norm), sd = sd(test_data$norm))

# Finish the ggplot
ggplot(test_data, aes(x = norm)) +
geom_histogram(aes(y=..density..))+
geom_density(col = "red") +
stat_function(fun = dnorm, args = fun_args, col="blue")
```

Adjusting density plots
There are three parameters that you may be tempted to adjust in a density plot:

bw - the smoothing bandwidth to be used, see ?density for details
adjust - adjustment of the bandwidth, see density for details
kernel - kernel used for density estimation, defined as
"g" = gaussian
"r" = rectangular
"t" = triangular
"e" = epanechnikov
"b" = biweight
"c" = cosine
"o" = optcosine
In this exercise you'll use a dataset containing only four points, small_data, so that you can see how these three arguments affect the shape of the density plot.

The vector get_bw contains the bandwidth that is used by default in geom_density(). p is a basic plotting object that you can start from.

```{r}
# small_data is available
small_data <- data.frame("x" = c(-3.5, 0.0,0.5, 6.0))

# Get the bandwith
get_bw <- density(small_data$x)$bw

# Basic plotting object
p <- ggplot(small_data, aes(x = x)) +
  geom_rug() +
  coord_cartesian(ylim = c(0,0.5))

# Create three plots
p + geom_density()
p + geom_density(adjust = 0.25)
p + geom_density(bw = 0.25 * get_bw)

# Create two plots
p + geom_density(kernel = "r")
p + geom_density(kernel = "e")
```

Box plots with varying width
A drawback of showing a box plot per group, is that you don't have any indication of the sample size, n, in each group, that went into making the plot. One way of dealing with this is to use a variable width for the box, which reflects differences in n.

Can you add some good-looking box plots to the basic plot coded on the right?

```{r}
# Finish the plot
ggplot(diamonds, aes(x = cut, y = price, col = color)) +
  geom_boxplot(varwidth = TRUE) +
  facet_grid(. ~ color)
```

Mulitple density plots
In this exercise you'll combine multiple density plots. Here, you'll combine just two distributions, a normal and a bimodal.

The first thing to remember is that you can consider values as two separate variables, like in the test_data data frame, or as a single continuous variable with their ID as a separate categorical variable, like in the test_data2 data frame. test_data2 is more convenient for combining and comparing multiple distributions.

```{r}
test_data2 <- data.frame("dist" = c(rep("norm", 200), rep("bimodal", 200)),
                         "value" = c(test_data$norm, test_data$bimodal))

# test_data and test_data2 are available
str(test_data)
str(test_data2)

# Plot with test_data
ggplot(test_data, aes(x = norm)) +
  geom_rug()+
  geom_density()

# Plot two distributions with test_data2
ggplot(test_data2, aes(x = value, fill = dist, col = dist)) +
  geom_rug(alpha = 0.6) +
  geom_density(alpha = 0.6)

```

Multiple density plots (2)
When you looked at multiple box plots, you compared the total sleep time of various mammals, sorted according to their eating habits. One thing you noted is that for insectivores, box plots didn't really make sense, since there were only 5 observations to begin with. You decided that you could nonetheless use the width of a box plot to show the difference in sample size between the groups. Here, you'll see a similar thing with density plots.

A cleaned up version of the mammalian dataset is available as mammals.

```{r}
head(msleep)
mammals <- msleep[,c("vore","sleep_total")]
mammals
```

```{r}
# Individual densities
ggplot(mammals[mammals$vore == "Insecti", ], aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# With faceting
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3)) +
  facet_wrap( ~ vore, nrow = 2)

# Note that by default, the x ranges fill the scale
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Trim each density plot individually
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35, trim = TRUE) +
  scale_x_continuous(limits=c(0,24)) +
  coord_cartesian(ylim = c(0, 0.3))
```

Weighted density plots
When plotting a single variable, the density plots (and their bandwidths) are calculated separate for each variable (see the plot from the previous exercise, provided).

However, when you compare several variables (such as eating habits) it's useful to see the density of each subset in relation to the whole data set. This holds true for multiple density plots as well as for violin plots.

For this, we need to weight the density plots so that they're relative to each other. Each density plot is adjusted according to what proportion of the total data set each sub-group represents. We calculated this using the dplyr commands on lines 11-15.

The mammals data frame is available as before. After executing the commnads, it will have the variable n, which we'll use for weighting.

```{r}
# Unweighted density plot from before
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Unweighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin()

# Calculate weighting measure
library(dplyr)
mammals2 <- mammals %>%
  group_by(vore) %>%
  mutate(n = n() / nrow(mammals)) -> mammals

# Weighted density plot
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(aes(weight = n), col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Weighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin(aes(weight = n), col = NA)
```

2D density plots (1)
You can consider two orthogonal density plots in the form of a 2D density plot. Just like with a 1D density plot, you can adjust the bandwidth of both axes independently.

The data is stored in the faithful data frame, available in the datasets package. The object p contains the base definitions of a plot.

```{r}
# Base layers
p <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
  coord_fixed(60 / 4.5)

# 1 - Use geom_density_2d()
p + geom_density_2d()

# 2 - Use stat_density_2d() with arguments
p + stat_density_2d(aes(col = ..level..), h = c(5, 0.5))
```

2D density plots (2)
Continuing with the density plots from the last exercise, here you'll explore the viridis package. This package contains multi-hue color palettes suitable for continuous variables.

The advantage of these scales is that instead of providing an even color gradient for a continuous scale, they highlight the highest values by using an uneven color gradient on purpose. The high values are lighter colors (yellow versus blue), so they stand out more.

A shaded 2D density plot showing the same data as the previous exercise has been provided for you. Up to you to upgrade it!

```{r}
# Load in the viridis package
library(viridis)

# Add viridis color scale
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0,0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0,0)) +
  coord_fixed(60/4.5) +
  stat_density_2d(geom = "tile", aes(fill = ..density..), h=c(5,.5), contour = FALSE)+ scale_fill_viridis()
```

Pair plots and correlation matrices
On startup, R features two useful quick-and-dirty pairs plots functions. They both only take continuous variables.

You'll be working with the iris dataset and with mtcars_fact, a version of mtcars where categorical variables have been converted into actual factor columns.

```{r}
# pairs
pairs(iris[1:4])

# chart.Correlation
library(PerformanceAnalytics)
chart.Correlation(iris[1:4])

# ggpairs
library(GGally)
ggpairs(iris[1:3])
```

Create a correlation matrix in ggplot2
Instead of using an off-the-shelf correlation matrix function, you can of course create your own plot. Just for fun, in this exercise, you'll re-create the scatterplot you see on the right. The strength of the correlation is depicted by the size and color of the points and labels.

For starters, a correlation matrix can be calculated using, for example, cor(dataframe) (if all variables are numerical). Before you can use your data frame to create your own correlation matrix plot, you'll need to get it in the right format.

In the editor, you can see the definition of cor_list(), a function that re-formats the data frame x. Here, L is used to add the points to the lower triangle of the matrix, and M is used to add the numerical values as text to the upper triangle of the matrix. With reshape2::melt(), the correlation matrices L and M are each converted into a three-column data frame: the x and y axes of the correlation matrix make up the first two columns and the corresponding correlation coefficient makes up the third column. These become the new variables "points" and "labels", which can be mapped onto the size aesthetic for the points in the lower triangle and onto the label aesthetic for the text in the upper triangle, respectively. Their values will be the same, but their positions on the plot will be symmetrical about the diagonal! Merging L and M, you have everything you need.

If you're not familiar with reshape2 - don't worry, the only reason we use that instead of tidyr is that reshape2::melt() can handle a matrix, whereas tidyr::gather() requires a data frame. At this point you just need to understand how to use the output from cor_list().

You'll first use dplyr to execute this function on the continuous variables in the iris data frame (the first four columns), but separately for each species. Please refer to the course on dplyr if you are not familiar with these functions.

Next, you'll actually plot the resulting data frame with ggplot2 functions.

```{r}
library(ggplot2)
library(reshape2)

cor_list <- function(x) {
  L <- M <- cor(x)
  
  M[lower.tri(M, diag = TRUE)] <- NA
  M <- melt(M)
  names(M)[3] <- "points"
  
  L[upper.tri(L, diag = TRUE)] <- NA
  L <- melt(L)
  names(L)[3] <- "labels"
  
  merge(M, L)
}

# Calculate xx with cor_list
library(dplyr)
xx <- iris %>%
  group_by(Species) %>%
  do(cor_list(.[1:4])) 

# Finish the plot
ggplot(xx, aes(x = Var1, y = Var2)) +
  geom_point(aes(col = points, size = abs(points)), shape = 16) +
  geom_text(aes(col = labels,  size = abs(labels), label = round(labels, 2))) +
  scale_size(range = c(0, 6)) +
  scale_color_gradient2("r", limits = c(-1, 1)) +
  scale_y_discrete("", limits = rev(levels(xx$Var1))) +
  scale_x_discrete("") +
  guides(size = FALSE) +
  geom_abline(slope = -1, intercept = nlevels(xx$Var1) + 1) +
  coord_fixed() +
  facet_grid(. ~ Species) +
  theme(axis.text.y = element_text(angle = 45, hjust = 1),
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_blank())
```

Proportional/stacked bar plots
Before you head over to ternary plots, let's try to make a classical proportional/stacked bar plot of a subset of the data. We'll use a stacked bar plot and the coord_flip() function to flips the x and y axes.

The data frame for the African Soil Profiles Database is available in your workspace as africa and can be found in the GSIF package. It contains three columns: Sand, Silt and Clay. A smaller version, containing only 50 observations is stored in africa_sample.

In the first course we mentioned that in the data layer, the structure of the data should reflect how you wish to plot it. For a ternary plot, you need to have three separate variables, for example, Sand, Silt and Clay in africa. However, for a proportional/stacked bar plot, you just need two. The type should be defined as three levels within a single factor variable. That is, you want tidy data.

It's also useful to maintain the site IDs as a variable within the data frame, currently, they are stored at row names, which is poor style and not useful.

```{r}

# Explore africa
str(africa)
africa_sample <- africa[sample(nrow(africa), 50), ]
str(africa_sample)

# Add an ID column from the row.names
africa_sample$ID <- row.names(africa_sample)

# Gather africa_sample
library(tidyr)
africa_sample_tidy <- gather(africa_sample, key, value, -ID)
head(africa_sample_tidy)

# Finish the ggplot command
ggplot(africa_sample_tidy, aes(x = factor(ID), y = value, fill = key)) +
  geom_col() +
  coord_flip()
```

Producing ternary plots
Ok, let's move onto ternary plots. For this you'll use the ggtern package, which provides the ggtern() function.

In contrast to what you just saw in africa_small_tidy, the three soil properties, Sand, Silt and Clay, are not going to be located in a single variable. The distinction between wide and tidy format data was discussed in the first course and here you'll see it in action. Sometimes you need to rearrange your data for the desired plot type.

Here, you'll use the complete dataset, africa, containing three separate variables for the measures of interest: that format is perfect for a ternary plot.

```{r}
# Load ggtern
library(ggtern)

# Build ternary plot
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_point(shape=16, alpha=0.2)
```

Adjusting ternary plots
Ternary plots have been around for a while in R; you could achieve the same thing with the vcd package authored by Michael Friendly. If you just need a quick and dirty ternary plot, that may suit you just fine. However, since ggtern is built on ggplot2, you can take advantage of all the tools available therein.

ggtern is authored by Nicholas Hamilton, more information can be found on his package website: www.ggtern.com.

The plot from the previous exercise is available twice. Can you adapt it in different ways to make different ternary density plots?

```{r}
# ggtern and ggplot2 are loaded
# Original plot:
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_point(shape = 16, alpha = 0.2)

# Plot 1
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_density_tern()

# Plot 2
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  stat_density_tern(geom = "polygon", aes(fill = ..level.., alpha = ..level..)) +
  guides(fill = FALSE)
```

Build the network (1)
Network data may be stored in a variety of ways.

For this example, you'll use an undirected network of romantic relationships in the TV show Mad Men: geomnet::madmen.

```{r}
# Load geomnet & examine structure of madmen
library(geomnet)
str(madmen)

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Examine structure of mmnet
str(mmnet)
```

Build the network (2)
Now that your data is in the correct format, you can build the actual network plot.

You'll use the geom_net() function, a ggplot layer that's in the geomnet package. The ggnetwork package is a popular alternative, but we will not discuss that here.

Can you finish the ggplot() command?

```{r}
# geomnet is pre-loaded

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Finish the ggplot command
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(aes(col=Gender), size=6, linewidth=1, labelon=TRUE, fontsize=3, labelcolour="black")
```

Adjusting the network
Let's clean up the network a bit. As you can see, since this is in the ggplot2 framework, you can manually adjust the scales like you have always done.

Here you're going to use another trick to remove all theme elements and make a clean network plot.

```{r}
# geomnet is pre-loaded
library(ggmap)
# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Tweak the network plot
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(aes(col = Gender),
           size = 6,
           linewidth = 1,
           labelon = TRUE,
           fontsize = 3,
           labelcolour = "black",
           directed = TRUE) +
  scale_color_manual(values = c("#FF69B4", "#0099ff")) +
  xlim(c(-0.05, 1.05)) +
  ggmap::theme_nothing(legend = TRUE) +
  theme(legend.key = element_blank())
```

Autoplot on linear models
R has several plotting methods for specific objects. For example using plot() on the results of an lm() call results in four plots that give you insight into how well the assigned model fits the data.

The ggfortify package is an all-purpose plot converter between base graphics and ggplot2 grid graphics.

You'll explore exactly what we mean by graphics and grid in chapter 4. For now, just know that if you want to use the automatic output features in the context of ggplot2, they must first be converted to a ggplot object via ggfortify. This can be important at the superficial level, for consistency in appearance, but also at a deeper level, for later combining several plots in a single graphics device.

```{r}
# Create linear model: res
res <- lm(Volume~Girth, data = trees)

# Plot res
plot(res)

# Import ggfortify and use autoplot()
library(ggfortify)
autoplot(res, ncol=2)
```

ggfortify - time series
Time series objects (class mts or ts) also have their own methods for plot(). ggfortify can also take advantage of this functionality.

In the workspace, you'll find the variable Canada (it comes from the vars package): an mts class object with four series: prod is a measure of labour productivity, e is employment, U is the unemployment rate, and rw the real wage. They are each plotted as separate series by default.

```{r}
# ggfortify and Canada are available
library(vars)
# Inspect structure of Canada
str(Canada)

# Call plot() on Canada
plot(Canada)

# Call autoplot() on Canada
autoplot(Canada)

```

Distance matrices and Multi-Dimensional Scaling (MDS)
As you can probably imagine, distance matrices (class dist) contain the measured distance between all pair-wise combinations of many points. For example, the eurodist dataset contains the distances between major European cities. dist objects lend themselves well to autoplot().

The cmdscale() function from the stats package performs Classical Multi-Dimensional Scaling and returns point coodinates as a matrix. Although autoplot() will work on this object, it will produce a heatmap, and not a scatter plot. However, if either eig = TRUE, add = TRUE or x.ret = TRUE is specified, cmdscale() will return a list instead of matrix. In these cases, the list method for autoplot() in the ggfortify package can deal with the output. Specifics on multi-dimensional scaling is beyond the scope of this course, however details on the method and these arguments can be found in the help pages ?cmdscale.

```{r}
# ggfortify and eurodist are available
# Autoplot + ggplot2 tweaking
autoplot(eurodist) + 
  coord_fixed()

# Autoplot of MDS
autoplot(cmdscale(eurodist, eig = TRUE), 
         label = TRUE, 
         label.size = 3, 
         size = 0)
```

Plotting K-means clustering
ggfortify also supports stats::kmeans class objects. You must explicitly pass the original data to the autoplot function via the data argument, since kmeans objects don't contain the original data. The result will be automatically colored according to cluster.

Here, you'll use the iris dataset and just look at K-means clustering, although this works on many clustering methods, including cluster::clara(), cluster::fanny(), cluster::pam() and stats::prcomp(). Unfortunately a discussion of these clustering methods is beyond the scope of this course.

```{r}
# Perform clustering
iris_k <- kmeans(iris[-5], 3)

# Autoplot: color according to cluster
autoplot(iris_k, data = iris, frame = TRUE)

# Autoplot: above, plus shape according to species
autoplot(iris_k, data = iris, frame = TRUE, shape='Species')
```

Working with maps from the maps package: USA
The easiest way to obtain map polygons is through the maps package. Unfortunately there are only a few locations available, but if your region of interest is included they are extremely convenient.

The available maps of political boundaries are:

Global: world, world2
Country: france, italy, nz, usa
USA: county, state
The maps can be accessed via ggplot2::map_data(), which converts the map into a data frame containing the variables long and lat. To draw the map, you need to use geom_polygon() which will connect the points of latitude and longitude for you.

```{r}
library(maps)
# maps, ggplot2, and ggmap are pre-loaded
# Use map_data() to create usa and inspect
usa <- map_data("usa")
str(usa)

# Build the map
ggplot(usa, aes(x = long, y = lat, group = group)) +
  geom_polygon() +
  coord_map() +
  theme_nothing()
```

The population pyramid
Animations are particularly useful for temporal or geospatial data, and they are surprisingly easy to make! Here, you simply loop over the time variable in your dataset, composing a new plot for each subset in the data. These individual images are then cataloged in an animated GIF file.

To show this you'll use a great animated population pyramid that was presented on the Revolutions blog. There are many more adjustments you could have made to the plot, but we'll just make a barebones version here.

```{r}
japan <- read.table("japanPOP.txt", header=TRUE)
head(japan)
```

```{r}
# Inspect structure of japan
str(japan)
library(animation)
# Finish the code inside saveGIF
saveGIF({

  # Loop through all time points
  for (i in unique(japan$time)) {

    # Subset japan: data
    data <- subset(japan, time == i)

    # Finish the ggplot command
    p <- ggplot(data, aes(x = AGE, y = POP, fill = SEX, width = 1)) +
      coord_flip() +
      geom_bar(data = data[data$SEX == "Female",], stat = "identity") +
      geom_bar(data = data[data$SEX == "Male",], stat = "identity") +
      ggtitle(i)

    print(p)

  }

}, movie.name = "pyramid.gif", interval = 0.1)
```

Arranging plots (1)
The functions in gridExtra allow you to arrange any number of plots in a variety of ways. Since you can access the legend as a separate object, that means you can also arrange multiple plots with a single legend, as shown in the viewer. This is a good alternative to faceting, since with facets it's not possible to set a different geom for each sub-plot. Here, you can combine any variety of plots and use a consistent color scale with only one legend to unify the whole image.

To do this you'll create a new arrange graphical object, using grid.arrange(), which will combine several pre-existing grobs. Just like with grid.rect() and rectGrob() there are two versions of the arrange grob, one grid.arrange() produces a graphics output, which means you just draw the item to the viewer, and arrangeGrob() which returns a graphical object, aka grob which can be further manipulated.

In this exercise, you'll just create your objects and arrange them using grid.arrange(). In the first steps you created two basic plots, g1 and g2. In the next exercise you'll see what to do about the legend.

```{r}
# Add a theme (legend at the bottom)
g1 <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
  geom_point(alpha = 0.5) +
  theme(legend.position = "bottom")

# Add a theme (no legend)
g2 <- ggplot(mtcars, aes(disp, fill = cyl)) +
  geom_histogram(position = "identity", alpha = 0.5, binwidth = 20) +
  theme(legend.position = "none")

# Load gridExtra
library(gridExtra)

# Call grid.arrange()
grid.arrange(g1, g2, ncol = 2)
```

Arranging plots (2)
In the previous exercise you did a bare-bones arrangement of plots, but it would be nicer if the plot looks like the one that's shown in the viewer. You can imagine that you have three panels, not two. There are two asymmetrical rows, the small second row is where the legend is, and two symmetrical columns, where the plots are.

To obtain this plot you need to extract the legend. You already saw this in previous exercises and it has already been done for you; the legend is available as my_legend. Next you need to arrange all the items appropriately.

```{r}
# ggplot2, grid and gridExtra have been loaded for you
# Definitions of g1 and g2
library("grid")
g1 <- ggplot(mtcars, aes(wt, mpg, col = cyl)) +
  geom_point() +
  theme(legend.position = "bottom")

g2 <- ggplot(mtcars, aes(disp, fill = cyl)) +
  geom_histogram(binwidth = 20) +
  theme(legend.position = "none")

legend_index <- 15
# Extract the legend from g1
my_legend <- ggplotGrob(g1)$grobs[[legend_index]]  

# Create g1_noleg
g1_noleg <- g1 + 
    theme(legend.position = "none")

# Calculate the height: legend_height
legend_height <- sum(my_legend$heights)

# Arrange g1_noleg, g2 and my_legend
grid.arrange(g1_noleg, g2, my_legend,
             layout_matrix = matrix(c(1, 3, 2, 3), ncol = 2),
             heights = unit.c(unit(1, "npc") - legend_height, legend_height))

```

Base package bag plot
Before you create your own stats layer, you'll begin by understanding what a bag plot is, and how to get the data for your own plots.

For this you'll use a fake dataset called test_data, which only contains two variables. A scatter plot is shown in the viewer.

The aplpack package, which contains the bagplot() and compute.bagplot() functions, has been loaded for you.

```{r}
library(aplpack)
x = floor(runif(60, 1700, 3700 ))
y = floor(runif(60, 50, 330))
test_data <- data.frame("x" = x, "y" = y)
test_data
```
```{r}
# Call bagplot() on test_data
bagplot(test_data)

# Call compute.bagplot on test_data, assign to bag
bag <- compute.bagplot(test_data)

# Display information
bag$hull.loop
bag$hull.bag
bag$pxy.outlier

# Highlight components
points(bag$hull.loop, col = "green", pch = 16)
points(bag$hull.bag, col = "orange", pch = 16)
points(bag$pxy.outlier, col = "purple", pch = 16)
```

Multilayer ggplot2 bag plot
The viewer shows the plot you created in the last exercise.

With our current understanding, if we wanted to make a bag plot in ggplot2, we'd take the three data frames (for the loop, bag and outliers) and add them using three separate geom layers.

Let's see how this simple solution works and in the next exercises you'll expand on this topic to make a real stats layer.

The bag and test_data objects from the previous exercise are provided. test_data contains two variables: x and y.

```{r}
# bag and test_data are available

# Create data frames from matrices
hull.loop <- data.frame(x = bag$hull.loop[,1], y = bag$hull.loop[,2])
hull.bag <- data.frame(x = bag$hull.bag[,1], y = bag$hull.bag[,2])
pxy.outlier <- data.frame(x = bag$pxy.outlier[,1], y = bag$pxy.outlier[,2])

# Finish the ggplot command
ggplot(test_data, aes(x = x,  y = y)) +
  geom_polygon(data = hull.loop, fill = "green") +
  geom_polygon(data = hull.bag, fill = "orange") +
  geom_point(data = pxy.outlier, col = "purple", pch = 16, cex = 1.5)
```

Creating ggproto functions
Now that you know where to find the statistics and how to use them in ggplot2, let's put them into the functions that will make them easier to use.

For this you'll use the ggproto object-oriented programming system - the basis of creating a new layer in ggplot2. There are four arguments for a ggproto object. The first two arguments are its name and what it inherits from (in this case Stat). Next come the required aesthetics, and then, most importantly, what the stat should do. For each group of data it receives from the data layer, what should be computed? This will simply be the calculations you performed in the previous exercise.

The ggproto object definition of StatLoop is already provided. Can you finish the implementations for the other ones?

```{r}
# ggproto for StatLoop (hull.loop)
StatLoop <- ggproto("StatLoop", Stat,
                    required_aes = c("x", "y"),
                    compute_group = function(data, scales) {
                      bag <- compute.bagplot(x = data$x, y = data$y)
                      data.frame(x = bag$hull.loop[,1], y = bag$hull.loop[,2])
                    })

# ggproto for StatBag (hull.bag)
StatBag <- ggproto("StatBag", Stat,
                   required_aes = c("x", "y"),
                   compute_group = function(data, scales) {
                     bag <- compute.bagplot(x = data$x, y = data$y)
                     data.frame(x = bag$hull.bag[,1], y = bag$hull.bag[,2])
                   })

# ggproto for StatOut (pxy.outlier)
StatOut <- ggproto("StatOut", Stat,
                   required_aes = c("x", "y"),
                   compute_group = function(data, scales) {
                     bag <- compute.bagplot(x = data$x, y = data$y)
                     data.frame(x = bag$pxy.outlier[,1], y = bag$pxy.outlier[,2])
                   })
```

Creating stat_bag()
In the previous exercise you established three ggproto objects, now you need to combine them under a new ggplot2 function that you'll call stat_bag().

Adding a stat_bag() layer will execute each of the three ggproto objects that you just created.

Your three objects are called StatLoop, StatBag, StatOut, so you'll need three layers in your stat_bag() function, which you'll make with the layer() function. When you have multiple layers, you can combine them in a list by simply calling list().

For each layer, you'll also need to specify the approrpiate geom: "polygon" or "point".

The framework for the stat_bag() layer function has been provided for you.

```{r}
# StatLoop, StatBag and StatOut are available

# Combine ggproto objects in layers to build stat_bag()
stat_bag <- function(mapping = NULL, data = NULL, geom = "polygon",
                     position = "identity", na.rm = FALSE, show.legend = NA,
                     inherit.aes = TRUE, loop = FALSE, ...) {
  list(
    # StatLoop layer
    layer(
      stat = StatLoop, data = data, mapping = mapping, geom = geom, 
      position = position, show.legend = show.legend, inherit.aes = inherit.aes,
      params = list(na.rm = na.rm, alpha = 0.35, col = NA, ...)
    ),
    # StatBag layer
    layer(
      stat = StatBag, data = data, mapping = mapping, geom = geom, 
      position = position, show.legend = show.legend, inherit.aes = inherit.aes,
      params = list(na.rm = na.rm, alpha = 0.35, col = NA, ...)
    ),
    # StatOut layer
    layer(
      stat = StatOut, data = data, mapping = mapping, geom = "point", 
      position = position, show.legend = show.legend, inherit.aes = inherit.aes,
      params = list(na.rm = na.rm, alpha = 0.7, col = NA, shape = 21, ...)
    )
  )
}
```

Use stat_bag()
So far you've seen the basics for creating a new ggplot layer. It's bare-bones, but functional. You now have a working solution to the bag plot question.

The ggplot2 command that you've coded before is available. now, let's use stat_bag() to make our plot!

```{r}
# hull.loop, hull.bag and pxy.outlier are available
# stat_bag, test_data and test_data2 are available

# Previous method
ggplot(test_data, aes(x = x,  y = y)) +
  geom_polygon(data = hull.loop, fill = "green") +
  geom_polygon(data = hull.bag, fill = "orange") +
  geom_point(data = pxy.outlier, col = "purple", pch = 16, cex = 1.5)

# stat_bag
ggplot(test_data, aes(x = x, y = y)) +
  stat_bag(fill = 'black')
```

Viewport basics (1)
To get familiar with grid graphics, you'll begin with using some grid. functions. The grid package is already loaded into your R session, so you can get started straight away!

Note: In DataCamp's learning interface, each change you make to the plot will appear as a new plot, so you can see the effect of each command.

```{r}
# Draw rectangle in null viewport
grid.rect(gp = gpar(fill = "grey90"))

# Write text in null viewport
grid.text("null viewport")

# Draw a line
grid.lines(x = c(0, 0.75), y = c(0.25, 1),
          gp = gpar(lty = 2, col = "red"))

```

Viewport basics (2)
The code from the previous exercise that populates the null viewport with some basic shapes is already available. Let's take the next step and start manipulating the stack of viewports.

```{r}
# Populate null viewport
grid.rect(gp = gpar(fill = "grey90"))
grid.text("null viewport")
grid.lines(x = c(0,0.75), y = c(0.25, 1),
           gp = gpar(lty = 2, col = "red"))

# Create new viewport: vp
vp <- viewport(x = 0.5, y = 0.5, width = 0.5, height = 0.5, just = "center")

# Push vp
pushViewport(vp)

# Populate new viewport with rectangle
grid.rect(gp = gpar(fill = "blue"))
```

Build a plot from scratch (1)
Using the viewports, you can create plots, manipulating the space as needed.

In this exercise you'll establish your grid viewport and in the following exercise you'll populate it with values.

```{r}
# 1 - Create plot viewport: pvp
mar <- c(5, 4, 2, 2)
pvp <- plotViewport(mar)

# 2 - Push pvp
pushViewport(pvp)

# 3 - Add rectangle
grid.rect(gp = gpar(fill = "grey80"))

# Create data viewport: dvp
dvp <- dataViewport(xData = mtcars$wt, yData = mtcars$mpg)

# 4 - Push dvp
pushViewport(dvp)

# Add two axes
grid.xaxis()
grid.yaxis()
```


Build a plot from scratch (2)
The work you did before to build a plot from scratch is already included. Now you're ready to add the points and the appropriate labels.

```{r}
# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()

# 1 - Add text to x axis
grid.text("Weight", y = unit(-3, "lines"))

# 2 - Add text to y axis
grid.text("MPG", x = unit(-3, "lines"), rot = 90)

# 3 - Add points
grid.points(x = mtcars$wt, y = mtcars$mpg, pch = 16)
```

Modifying a plot with grid.edit
The commands you've coded up to now to create the plot are available in the editor. The great thing about grid, in comparison to base, is that you can name the different plot elements, so that you can access them and change them later on. You can do this with the grid.edit() function. Give it a try!

```{r}
# Work from before
pushViewport(plotViewport(c(5, 4, 2, 2)))
grid.rect(gp = gpar())
pushViewport(dataViewport(xData = mtcars$wt, yData = mtcars$mpg))
grid.xaxis()
grid.yaxis()

# Work from before - add names
grid.text("Weight", y = unit(-3, "lines"), name = "xaxis")
grid.text("MPG", x = unit(-3, "lines"), rot = 90, name = "yaxis")
grid.points(x = mtcars$wt, y = mtcars$mpg, pch = 16, name = "datapoints")

# Edit "xaxis"
grid.edit("xaxis", label = "Weight (1000 lbs)")

# Edit "yaxis"
grid.edit("yaxis", label = "Miles/(US) gallon")

# Edit "datapoints"
grid.edit("datapoints", gp = (gpar(col = "#C3212766", cex = 2)))
```

Exploring the gTable
In the previous chapter you saw graphical outputs using a variety of grid. functions. Graphical Objects, aka Grobs, are the object form of these items and can be found in your ggplot2 plots. Let's take a look at how these grobs are stored in ggplot objects.

To start, a simple plot, p, has been coded for you.

```{r}
# A simple plot p
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()

# Create gtab with ggplotGrob()
gtab <- ggplotGrob(p)

# Print out gtab
gtab

# Extract the grobs from gtab: gtab
g <- gtab$grobs

# Draw only the legend
legend_index <- which(vapply(g, inherits, what = "gtable", logical(1)))
grid.draw(g[[legend_index]])
```

Modifying the gTable
You can visualize the layout of a gTable object with gtable_show_layout(). In the layout plot, each segment is labelled with its position.

The legend, that you can access with g[[8]], is a gTable itself, so you can also show its layout. It's perfectly possible to update this layout by adding new graphical objects, similar to what you saw in the video.

```{r}
library(gtable)
# Code from before
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()
gtab <- ggplotGrob(p)
g <- gtab$grobs
legend_index <- which(vapply(g, inherits, what = "gtable", logical(1)))
grid.draw(g[[legend_index]])

# 1 - Show layout of legend grob
gtable_show_layout(g[[legend_index]])

# Create text grob
my_text <- textGrob(label = "Motor Trend, 1974", gp = gpar(fontsize = 7, col = "gray25"))

# 2 - Use gtable_add_grob to modify original gtab
new_legend <- gtable_add_grob(gtab$grobs[[legend_index]], my_text, 3, 2)

# 3 - Update in gtab
gtab$grobs[[legend_index]] <- new_legend

# 4 - Draw gtab
grid.draw(gtab)
```

Exploring ggplot objects
ggplot objects are basically just a named list that contains the information to make the actual plot. Here you'll explore the structure of this object.

```{r}
# Simple plot p
p <- ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()

# Examine class() and names()
class(p)
names(p)

# Print the scales sub-list
p$scales$scales

# Update p
p <- p +
  scale_x_continuous("Length", limits = c(4, 8), expand = c(0, 0)) +
  scale_y_continuous("Width", limits = c(2, 4.5), expand = c(0, 0))

# Print the scales sub-list
p$scales$scales
```

ggplot_build and ggplot_gtable
In the viewer we have produced a box plot of the mtcars dataset (called p) that you'll use to explore two key ggplot functions for accessing the object internals: ggplot_build() and ggplot_gtable().

ggplot_build() is executed when you want to display or save an actual ggplot plot. It takes the data input and produces the visual output.

```{r}
# Box plot of mtcars: p
p <- ggplot(mtcars, aes(x = factor(cyl), y = wt)) + geom_boxplot()

# Create pbuild
pbuild <- ggplot_build(p)

# a list of 3 elements
names(pbuild)

# Print out each element in pbuild
pbuild$data
pbuild$panel
pbuild$plot

# Create gtab from pbuild
gtab <- ggplot_gtable(pbuild)

# Draw gtab
grid.draw(gtab)
```

Extracting details
In the video you saw how to change the clipping parameters of a gTable object. Here, you'll see something more practical: how to extract calculated values.

Many geoms are associated with underlying descriptive statistics which are calculated and then plotted. In these cases you actually don't have the actual values that were plotted. Of course, these values are stored under the hood and you can access them in the results from ggplot_build(). This can be particularly useful for box plots. For example, since there are many methods for calculating Q1 and Q3, if you calculate your IQR and outliers outside of ggplot2 you may end up using a different method and get different results. Sometimes you want to have exactly the values that were plotted.

```{r}
# Box plot of mtcars: p
p <- ggplot(mtcars, aes(x = factor(cyl), y = wt)) + geom_boxplot()

# Build pdata
pdata <- ggplot_build(p)$data

# confirm that the first element of the list is a data frame
class(pdata[[1]])

# Isolate this data frame
my_df <- pdata[[1]]

# The x labels
my_df$group <- c("4", "6", "8")

# Print out specific variables
my_df[c(1:6, 11)]
```


